A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations require data-driven solutions. In this report, we analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The provided dataset contains the following columns:
no_of_adults: Number of adultsno_of_children: Number of Childrenno_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotelno_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hoteltype_of_meal_plan: Type of meal plan booked by the customer:required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by Star Hotels Grouplead_time: Number of days between the date of booking and the arrival datearrival_year: Year of arrival datearrival_month: Month of arrival datearrival_date: Date of the monthmarket_segment_type: Market segment designation.repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current bookingno_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current bookingavg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)booking_status: Flag indicating if the booking was canceled or not.import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pylab
import scipy.stats as stats
#Removes the limit from the number of displayed columns and rows.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
#Using plotly for specific plots of categorical variables
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
#Add a nice bachground to graphs and show graghs in the notebood
sns.set(color_codes=True)
%matplotlib inline
#Function to randomly split the data into train data and test data
from sklearn.model_selection import train_test_split
#To build logistic regression_model using sklearn
from sklearn.linear_model import LogisticRegression
#To build logistic regression_model using statsmodels
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
#!pip install -U scikit-learn --user
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different decision tree models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
roc_auc_score,
precision_recall_curve,
roc_curve,
)
from sklearn.metrics import confusion_matrix
#To change numeric month to month name
import calendar
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
#importing DataFrame with the name "used_phone_data.csv"
data=pd.read_csv('C:/Users/Adis/Desktop/Data Science/Project 4- Classification/StarHotelsGroup.csv')
data.head()
print(f'There are {data.shape[1]} columns and {data.shape[0]} rows in the data set.') # f-string
Let us take a look at the imported data and the summary of different columns:
data.info()
Four of the columns represent categorical variables (qualitative), i.e.:
And 14 other columns represent quantitative variables:
Now we check the missing values in the data. Below, number of missing values in any column of the imported data are shown:
data.isnull().sum()
We can see that there are no missing values in our dataframe.
Questions:
1. What are the busiest months in the hotel?
First we convert month int to month name.
Next, we tabulate frequencies of each arrival month as below:
# making data fram to show the month name and frequency
df_month=data['arrival_month'].value_counts().to_frame()
df_month.rename(columns={'arrival_month': 'frequencies'},inplace=True)
df_month.reset_index(inplace=True)
df_month = df_month.rename(columns = {'index':'arrival month'})
#we convert month int to month name
df_month['arrival month'] = df_month['arrival month'].apply(lambda x: calendar.month_abbr[x])
df_month
Graph below shows frequncy of each moth in descending order.
sns.set_style("whitegrid")
fig = plt.figure(figsize=(15, 4));
# Adds subplot on position 1
fig.add_subplot(121)
# plot the barchart
ax = data['arrival_month'].value_counts().plot(kind="bar", rot=90)
# Make twin axis
ax2 = ax.twinx()
# display counts on each bar
for p in ax.patches:
ax.annotate('{}'.format(p.get_height()), (p.get_x() -0.1, p.get_height()+10) , fontsize=12, weight='bold')
#adding labels
ax.set(xlabel='arrival month', ylabel='count');
Observations:
2. Which market segment do most of the guests come from?
data['market_segment_type'].value_counts()
sns.set_style("darkgrid")
fig = plt.figure(figsize=(15, 4));
# Adds subplot on position 1
fig.add_subplot(121)
# plot the barchart
ax = data['market_segment_type'].value_counts().plot(kind="bar", rot=90)
# display counts on each bar
for p in ax.patches:
ax.annotate('{}'.format(p.get_height()), (p.get_x() -0.1, p.get_height()+10) , fontsize=12, weight='bold')
#adding labels
ax.set(xlabel='market segment type', ylabel='count');
Observations:
3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
graph=sns.catplot(data=data, x="avg_price_per_room", y="market_segment_type",
kind="box", height=4, aspect=2);
#adding labels
graph.set(xlabel='average price per room', ylabel='market segment');
Observations:
graph=sns.catplot(data=data, x="avg_price_per_room", y="market_segment_type", hue='room_type_reserved',
kind="box", height=4, aspect=2);
#adding labels
graph.set(xlabel='average price per room', ylabel='market segment');
Price changes are highest for combination of room type 7 with Offline and Corporate market segments.
4. What percentage of bookings are canceled?
print('{}% of bookings are canceled.'.format(round(data['booking_status'].value_counts(normalize=True).mul(100)[1],2)))
5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
print('Only {}% of repeating guests cancel their booking. Hence, we can conclude that most of the cancelations are done by non-repeating guests.'.format(round(data[data['repeated_guest']==1]['booking_status'].value_counts(normalize=True).mul(100)[1],2)))
6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
#plotting
graph=sns.catplot(data=data, y="no_of_special_requests", x="booking_status",
kind="violin", height=4, aspect=2);
#adding labels
graph.set(xlabel='booking status', ylabel='number of special requests');
#definig the color
color=sns. set_palette("Set3");
#plotting
graph = sns.FacetGrid(data, col="no_of_special_requests", hue='booking_status', palette=color)
graph.map(sns.histplot, "booking_status" );
#adding labels
graph.set(xlabel='booking status', ylabel='count');
#plotting
graph = sns.FacetGrid(data, col="booking_status", hue='no_of_special_requests')
graph.map(sns.histplot, "no_of_special_requests" );
#adding labels
graph.set(xlabel='number of special requests', ylabel='count');
Observations:
data.describe().T
Observations:
no_of_previous_bookings_not_canceled seems to be very high. This variable ranges from 0 to 72, while its median is 0 which does not seam sensible.avg_price_per_room seems to be high. This variable ranges from 0 to 540, while its median is 105.#plotting heat map
df_corr=data.corr()
fig = plt.figure(figsize=(10, 8));
# color map
cmap = sns.diverging_palette(0, 230, 90, 60, as_cmap=True)
# plot heatmap
ax=sns.heatmap(df_corr, annot=True, fmt=".2f", linewidths=5, cmap=cmap, vmin=-1, vmax=1, square=True);
plt.title('Correlation heat map for the entire data');
fig.tight_layout()
Observations:
Let us check values of each categorical variable:
# looking at value counts for non-numeric features
num_to_display = 12 # defining number of displayed levels for each non-numeric feature
for colname in data.dtypes[data.dtypes == 'object'].index:
val_counts = data[colname].value_counts(dropna=False) # Show NA counts
print(f'\n\ncategrical variable= {colname} ') #f-String
if len(val_counts) > num_to_display:
print(f'Only displaying first {num_to_display} of {len(val_counts)} values.\n') #f-String
print(val_counts.iloc[:num_to_display])
Now, Let us check quantitative variable:
def histogram_boxplot(data, feature, figsize=(10, 5), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(data, "lead_time", bins=70)
Lead time values mostly between 0 and 150 days. There seems to be a lot of outliers in the variable.
histogram_boxplot(data, "no_of_previous_bookings_not_canceled", bins=70)
Most of the quests do not have any previous bookings that has not been canceled. This can be due to the fact that they have never have booked a room in the Stars Hotels.
histogram_boxplot(data, "avg_price_per_room",)
avg_price_per_room seems to have a normal distributions.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 4))
else:
plt.figure(figsize=(n + 2, 4))
plt.xticks(rotation=90, fontsize=12)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to create labeled barplots
def labeled_barplot_hue(data, feature, hue, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 4))
else:
plt.figure(figsize=(n + 2, 4))
plt.xticks(rotation=90, fontsize=12)
ax = sns.countplot(
data=data,
x=feature,
hue=hue,
palette="husl",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(data, "type_of_meal_plan", perc=True)
The most frequent meal plan is Plan 1 with 74.4% frequency. Plan 3 is not desirable for quests.
labeled_barplot_hue(data, "type_of_meal_plan", "booking_status", perc=True)
No apparent connection is detected between meal plans with cancelations for plan 1.
labeled_barplot(data, "room_type_reserved", perc=True)
Room type 1 is the most prefered room by quests.
labeled_barplot_hue(data, "room_type_reserved", "booking_status", perc=True)
Cancallations and room types does not seem to have any pecific correlations.
labeled_barplot_hue(data, "market_segment_type", "booking_status", perc=True)
We devide the list of the numerical values into two groups to draw their pair plot in a way that is better for visualization.
# numerical columns
columns=data.dtypes[data.dtypes != 'object'].index
columns1=columns[0:7]
columns2=columns[7:]
sns.pairplot(data=data[columns1]);
sns.pairplot(data=data[columns2]);
data.isnull().sum()
There are no missing values in te dataframe.
Now, we want to figure out whether we have duplicate data and how to deal with them.
# Count the number of non-duplicates
(~data.duplicated()).sum()
# Count the number of duplicates in rows
data.duplicated().sum()
print('Among the {} rows of the dataframe {} rows are unique. But, {} rows are duplicates.'.format(data.shape[0] , (~data.duplicated()).sum(), data.duplicated().sum()))
For treating these dulicates, we duplicates except for the first occurrence in the dataframe.
data.drop_duplicates(inplace=True);
print('In the revised dataframe, there are total of {} rows and number of dimplicated rows equals to {}.'.format(data.shape[0] , data.duplicated().sum()))
Histogram of the no_of_previous_bookings_not_canceled is shown below.
fig = plt.figure(figsize=(40, 4));
# Adds subplot on position 1
fig.add_subplot(121)
# plot the barchart
ax = data['no_of_previous_bookings_not_canceled'].value_counts().plot(kind="bar", rot=90)
# Make twin axis
ax2 = ax.twinx()
# display counts on each bar
for p in ax.patches:
ax.annotate('{}'.format(p.get_height()), (p.get_x() -0.1, p.get_height()+10) , fontsize=12, weight='bold')
#adding labels
ax.set(xlabel='no of previous bookings not canceled', ylabel='count');
print('The minimum value of variable "no_of_previous_bookings_not_canceled" is {} and the maximum amount is {}.'.format(data['no_of_previous_bookings_not_canceled'].min(),data['no_of_previous_bookings_not_canceled'].max()))
As we can see, the most frequenct value of no_of_previous_bookings_not_canceled is zero (with frequency equal to 41314). This means that most customers has canceled their prior bookings before the current booking. Furthormore, we see that frequency of values other than 0 are very small for this variable. For instance, the 10 most frequent values are shown below:
data['no_of_previous_bookings_not_canceled'].value_counts().head(10)
Graph below shows frequency of all other values of this variable except for 0.
#definig the color
color=sns. set_palette("dark")
#plotting
fig = plt.figure(figsize=(40, 8));
ax=sns.histplot(x='no_of_previous_bookings_not_canceled', data=data[data['no_of_previous_bookings_not_canceled']>0] , hue='booking_status', palette=color);
We can see than although the values of this feature reaches to 72, the frequencies drop significantly after value 0.
temp=data[data['no_of_previous_bookings_not_canceled']==0]['booking_status'].value_counts()
print('Moreover, when "no_of_previous_bookings_not_canceled"= 0, {} of the current bookings has been canceled and {} has not been canceled.\n'.format(temp[1],temp[0]) )
temp=data[data['no_of_previous_bookings_not_canceled']>0]['booking_status'].value_counts()
print('When "no_of_previous_bookings_not_canceled"> 1, only {} of the current bookings has been canceled and the other {} bookings has not been canceled.\n'.format(temp[1],temp[0]) )
temp=data[data['no_of_previous_bookings_not_canceled']>12]['booking_status'].value_counts()
print('Interestingly, when "no_of_previous_bookings_not_canceled"> 12, none of the {} current bookings has been canceled.'.format(temp[0]) )
Hence, it seem a good idea to bin the "no_of_previous_bookings_not_canceled" feature into smaller groups.
We bin the no_of_previous_bookings_not_canceled feature into 3 groups of [0, 1], (1, 12], and (12, 72].
#Binnig the variable into 3 bins
data['Binned_no_of_previous_bookings_not_canceled']=pd.cut(data['no_of_previous_bookings_not_canceled'],bins=[-1,1,12,72],labels=['[0, 1]','(1, 12]','(12, 72]'])
#Dropping the previous variable
data=data.drop(['no_of_previous_bookings_not_canceled'], axis=1)
Frequencies of each bin is displayed below:
data['Binned_no_of_previous_bookings_not_canceled'].value_counts()
#changing type of the variable to "object"
data['Binned_no_of_previous_bookings_not_canceled']=data['Binned_no_of_previous_bookings_not_canceled'].astype('object');
The histogram of the new variable with name Binned_no_of_previous_bookings_not_canceled is as:
#plotting
fig = plt.figure(figsize=(5, 4));
ax=sns.histplot(x='Binned_no_of_previous_bookings_not_canceled', data=data);
We also need to convert type of the Binned_no_of_previous_bookings_not_canceled from categorical to object values as follow:
An outlier is a data point that is distant from other similar points. Linear regression is easily impacted by the outliers in the data. Outliers can distort predictions and affect the accuracy so it's important to flag them for review. This is especially the case with regression models.
We use IQR, which is the interval going from the 1st quartile to the 3rd quartile of the data in question, and then flag points for investigation if they are outside 1.5 * IQR.
Let us plot the boxplots of all numerical columns to display outliers.
plt.figure(figsize=(20, 30))
# numerical columns
columns=data.dtypes[data.dtypes != 'object'].index
# plot
for i, variable in enumerate(columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
Calculating fraction of outliers for each variable:
The following function is used to calculate fracttion of outliers for each numerical columns based on IQR.
#Creating a function to calculate fraction of outliers
def frac_outside_IQR(y):
x=y.to_numpy(dtype=object)
length = 1.5 * np.diff(np.quantile(x, [.25, .75]))
frac=round(np.mean(np.abs(x - np.median(x)) > length),2)
return frac
#Create list of quantitative variables
numeric_columns=data.dtypes[data.dtypes != 'object'].index.tolist()
print('\nfraction of outliers for quantitative variables:')
#Apply the frac_outside_IQR function on the numeric_columns in data set
Out_frac=data[numeric_columns].apply(frac_outside_IQR, axis=0)
Out_var_list=Out_frac[Out_frac>0]
Out_var_list
num_to_display = 20 # defining number of displayed values for each numeric feature
for colname in Out_var_list.index:
val_counts = data[colname].value_counts(dropna=False) # Show NA counts
print(f'\n\nnumerical variable= {colname} ') #f-String
if len(val_counts) > num_to_display:
print(f'Only displaying first {num_to_display} and last {num_to_display} of {len(val_counts)} values.\n') #f-String
print(val_counts.iloc[: num_to_display])
no_of_adults: We have 184 rows correponding to 0 number of adults which does not make sense. The other values of adults number are 1,2 and 3 which are sensible.no_of_children: We have 2 rows correponding to 9 number of children which is very different from other children numbers (0,1,2,3). Also, there are some rows corresponding to 10 children.required_car_parking_space: required_car_parking_space is a binary variable and we will not need an outlier treatment for it as all its values are eather 1 or 0.repeated_guest: repeated_guest is a binary variable and we will not need an outlier treatment for it as all its values are eather 1 or 0.no_of_previous_cancellations: We have 25 rows correponding to 11 number of previous cancelations which does not make sense. Also, there are some rows corresponding to 13 cancelations. The other values of previous cancelations numbers ranges from 0 to 6 which are sensible.no_of_special_requests: Base on IQR, the fraction of outliers for no_of_special_requests is 3%. However, when we check the values of this value which ranges from 0 to 6 we do not see a need to consider the 0.03 fraction as outliers and hence no treatment will be applied on this variable.no_of_week_nights: The values of the no_of_week_nights belongs to a wide range. We will investigate it further to see if there is any need for outlier treatment.lead_time: The values of the lead_time belongs to a wide range. We will investigate it further to see if there is any need for outlier tratment.avg_price_per_room: The values of the no_of_previous_bookings_not_canceled belongs to a wide range. Hence, we use IQR to treat its outliers.no_of_adults: We change the number of adults from 0 to 1 to treat the 184 rows that correpond 0 number of adults.
#changing the value of no_of_adults from 0 to 1
data.loc[data[data['no_of_adults']==0].index,'no_of_adults']=1
data['no_of_adults'].value_counts()
no_of_children: We have 2 rows correponding to 9 number of children which is very different from other children numbers. We assign the corresponding values to the maximum number of children which equals to 3.
#changing the value of outlier no_of_children to 3 child
data.loc[data[data['no_of_children']==9].index,'no_of_children']=3
data.loc[data[data['no_of_children']==10].index,'no_of_children']=3
data['no_of_children'].value_counts()
no_of_previous_cancellations: We change the number of previous cancellations from 11 to 5 to treat the 25 rows that correpond to 11 number of cancellations.
#changing the value of no_of_previous_cancellations from 11,13 to 6
data.loc[data[data['no_of_previous_cancellations']==11].index,'no_of_previous_cancellations']=6
data.loc[data[data['no_of_previous_cancellations']==13].index,'no_of_previous_cancellations']=6
data['no_of_previous_cancellations'].value_counts()
avg_price_per_room: From the gragh below, we can see that the average price can be a good prediction of booking cancelations. The only values that seem to be outliers are the ones with 0 values and greater than 500 amount.
sns.histplot(data=data, x='avg_price_per_room',hue='booking_status');
There are 641 rows in the data that corresponds with room price=zero which does not make sense. Also there are a few rows that show room price=1. We will substitude these room prices with the median room price.
#changing the value of outlier to the median value
data.loc[data[data['avg_price_per_room']<5].index,'avg_price_per_room']=data['avg_price_per_room'].median()
sns.histplot(data=data, x='avg_price_per_room',hue='booking_status');
no_of_week_nights: The histogram of this varibale for values of "no_of_week_nights">5 is plotted below. Using IQR method some of these points are defined as outliers becuase they fall after the upper whisker. However, it seem that these points can give us valueble information and hence we will not consider them as outliers.
sns.histplot(data=data[data['no_of_week_nights']>5], x='no_of_week_nights',hue='booking_status');
lead_time: We treat outliers of "lead_time" by flooring and capping as follows:
def treat_outliers_func(x):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = x.quantile(0.25) # 25th quantile
Q3 = x.quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
x = np.clip(x, Lower_Whisker, Upper_Whisker)
return x
#Create list of quantitative variables
treating_vars=['lead_time']
#Apply the frac_outside_IQR function on the numeric_columns in data set
data[treating_vars]=data[treating_vars].apply(treat_outliers_func, axis=0)
Boxplots of the revised numeric variables is as below.
plt.figure(figsize=(15, 20))
# numerical columns
columns=['lead_time']
# plot
for i, variable in enumerate(columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
We change the values of booking_status which is our target variable and represent it with numeric values. If booking_status=0 it indicates that the current booking has not been canceled and if booking_status=1 it shows the booking has been canceled.
data["booking_status"] = data["booking_status"].apply(lambda x: 1 if x == "Canceled" else 0)
data.head()
print(f'There are {data.shape[1]} columns and {data.shape[0]} rows in the revised data set.') # f-string
Average, median ,standard deviation, min, max, and 1st and 3rd quantiles of quantitative variables are shown below:
data.describe()
#Create list of chategorical variables
object_columns=data.dtypes[data.dtypes == 'object'].index.tolist()
#Creating dummy variables and one-hot encoding for categorical variables
data=pd.get_dummies(data, columns = object_columns, drop_first=True)
data.head()
x=data.drop(['booking_status'], axis=1)
y=data[['booking_status']]
We'll split the data into train and test to be able to evaluate the model that we build on the train data.
# splitting the data in 70:30 ratio for train to test data
x_train1, x_test1, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
print("Number of rows in train data =", x_train1.shape[0]);
print("Number of rows in test data =", x_test1.shape[0]);
print("Number (percentage) of bookings in training set:")
print("non-canceled: {0} ({1:2.2f}%)".format(y_train['booking_status'].value_counts()[1], y_train['booking_status'].value_counts(normalize=True)[1] * 100 ))
print("canceled : {0} ({1:2.2f}%)".format(y_train['booking_status'].value_counts()[0], y_train['booking_status'].value_counts(normalize=True)[0] * 100 ))
print("\nNumber (percentage) of bookings in test set:")
print("non-canceled: {0} ({1:2.2f}%)".format(y_test['booking_status'].value_counts()[1], y_test['booking_status'].value_counts(normalize=True)[1] * 100 ))
print("canceled : {0} ({1:2.2f}%)".format(y_test['booking_status'].value_counts()[0], y_test['booking_status'].value_counts(normalize=True)[0] * 100 ))
The percentage of canceled and non-canceled bookings in the training and test data sets are almost equal. Hence, both data sets have a good distribution for booking status.
In our model, when variable y=1 it indicates that the room booking is canceled and if y=0 the booking has not been canceled. We aim to build a Logistic Regresion model to make prediction and be able to classify data points. The result of the Regression model is a float number between 0 and 1 that shows predicted probabilty of booking cancelation for each data point. Later, we will define a threshold in to set a threshold for the predicted probabilties and classify data points.
Both the cases are important as:
If we predict a booking will be canceled but in reality it does not then the guest will show up at hotel place and demand their reserved room which due to our wrong assumption might not be available. This can make a huge inconveniences for the guest and hotel staff and would damage the hotel reputaion and influence on future revenue.
On the contrary, a booking will not be canceled but in reality gets canceled then we will keep the room available for our assumed guest while they don't show up and it constitutes to opportunity loss of revenue.
f1_score should be maximized, the greater the f1_score higher the chances of identifying both the classes correctly.# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors)
y_pred = y_pred.apply(lambda x: 1 if x >= threshold else 0);
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors)
y_pred = y_pred.apply(lambda x: 1 if x >= threshold else 0);
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f} ".format(item) + "{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
annot_kws = {"va": 'bottom'}
sns.heatmap(cm, annot=labels, fmt="", annot_kws=annot_kws)
plt.ylabel("True label")
plt.xlabel("Predicted label")
fig.tight_layout()
In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.
Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.
There are different ways of detecting (or testing) multicollinearity. One such way is by using the Variance Inflation Factor, or VIF.
Variance Inflation Factor (VIF): Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
General rule of thumb:
Usinng the following function, we calculate VIF for each predictor variable
#defining a function to check VIF
def checking_vif(x):
vif = pd.DataFrame()
vif["feature"] = x.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(x.values, i)
for i in range(len(x.columns))
]
return vif
df_VIF=checking_vif(data)
df_VIF
Listing variables with VIF greater than 5.
df_VIF[df_VIF['VIF']>5]
The VIF values for dummy variables can be ignored. Among the quantitative variables, "no_of_adults", "arrival_year", "arrival_month", "avg_price_per_room", "market_segment_type_Corporate", "market_segment_type_Offline", "market_segment_type_Online", and "Binned_no_of_previous_bookings_notcanceled[0,1]" have VIF greater than 5.
To remove multicollinearity
Let's first define a function to calculate model performances for the case that each of the the high VIF variables is dropped.
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store performance measures
Recall = []
Accuracy = []
F1 = []
Precision=[]
# build the models by dropping one of the high VIF columns at a time
# store the performance measures in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
#Build the model
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(predictors, target);
# create the dataframe including performance measures
Data_performance = model_performance_classification_statsmodels(model, predictors, target, threshold=0.5)
# adding Accuracy, Recall, Precision, and F1 to the lists
Accuracy.append(Data_performance.iloc[0,0])
Recall.append(Data_performance.iloc[0,1])
Precision.append(Data_performance.iloc[0,2])
F1.append(Data_performance.iloc[0,3])
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Accuracy": Accuracy,
"Recall": Recall,
"Precision": Precision,
"F1": F1,
}
).sort_values(by="F1", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
#List of variables with VIF greater than 5
col_list = df_VIF[df_VIF['VIF']>5]['feature'].tolist()
#Performance measures
Performances = treating_multicollinearity(x_train1, y_train, col_list)
Performances
From the table above, we can see that dropping any of the high VIF variables has the same influence on the predictive power of the model. Hence, we drop market_segment_type_Online bacauese it has the greatest VIF value.
col_to_drop = "market_segment_type_Online"
#Dropping the column
x_train2 = x_train1.loc[:, ~x_train1.columns.str.startswith(col_to_drop)]
x_test2 = x_test1.loc[:, ~x_test1.columns.str.startswith(col_to_drop)]
# Check VIF now
df_VIF = checking_vif(x_train2)
print("Variables with VIF>5 after dropping variable", col_to_drop)
df_VIF[df_VIF['VIF']>5]
df_VIF[df_VIF['VIF']>5].iloc[4,0]
Quantitative variables "no_of_adults", "arrival_year", "arrival_month", "avg_price_per_room", and "Binned_no_of_previous_bookings_notcanceled" have VIF greater than 5.
#List of variables with VIF greater than 5
col_list = df_VIF[df_VIF['VIF']>5]['feature'].tolist()
#Performance measures
Performances = treating_multicollinearity(x_train2, y_train, col_list)
Performances
From the table above, we can see that dropping any of the high VIF variables has the same influence on the predictive power of the model. Hence, we drop Binned_no_of_previous_bookings_notcanceled[0, 1] bacauese it has the greatest VIF value.
col_to_drop = "Binned_no_of_previous_bookings_not_canceled_[0, 1]"
x_train3 = x_train2.loc[:, ~x_train2.columns.str.startswith(col_to_drop)]
x_test3 = x_test2.loc[:, ~x_test2.columns.str.startswith(col_to_drop)]
# Check VIF now
df_VIF= checking_vif(x_train3)
print("Variables with VIF>5 after dropping variable", col_to_drop)
df_VIF[df_VIF['VIF']>5]
We have two quantitative variables with VIF greater than 5.
#List of variables with VIF greater than 5
col_list = df_VIF[df_VIF['VIF']>5]['feature'].tolist()
#Performance measures
res = treating_multicollinearity(x_train3, y_train, col_list)
res
We will drop arrival_year.
col_to_drop = "arrival_year"
x_train4 = x_train3.loc[:, ~x_train3.columns.str.startswith(col_to_drop)]
x_test4 = x_test3.loc[:, ~x_test3.columns.str.startswith(col_to_drop)]
# Check VIF now
df_VIF= checking_vif(x_train4)
print("Variables with VIF>5 after dropping variable", col_to_drop)
df_VIF[df_VIF['VIF']>5]
We will no_of_adults and check VIF again.
col_to_drop = "no_of_adults"
x_train5 = x_train4.loc[:, ~x_train4.columns.str.startswith(col_to_drop)]
x_test5 = x_test4.loc[:, ~x_test4.columns.str.startswith(col_to_drop)]
# Check VIF now
df_VIF= checking_vif(x_train5)
print("Variables with VIF>5 after dropping variable", col_to_drop)
df_VIF[df_VIF['VIF']>5]
We will avg_price_per_room and check VIF again.
col_to_drop = "avg_price_per_room"
x_train6 = x_train5.loc[:, ~x_train5.columns.str.startswith(col_to_drop)]
x_test6 = x_test5.loc[:, ~x_test5.columns.str.startswith(col_to_drop)]
# Check VIF now
df_VIF= checking_vif(x_train6)
print("VIF after dropping variable", col_to_drop)
df_VIF
The used quantitative predictors have no multicollinearity and the assumption is satisfied.
Since we have dropped multiple columns to eliminate multicolinearity, we need to check again for duplicates in the training set.
print('There are {} number of duplicates in the x_train6 that needs to be treated.'.format(x_train6.duplicated().sum()))
Dropping the duplicated rows in both x_train and y_train:
# Finding list of the duplicate rows indexes
dup_index=x_train6.loc[x_train6.duplicated(), :].index.tolist()
# droping duplicates from train and test data
x_train7=x_train6.drop(index=dup_index, axis=0)
x_test7=x_test6
# droping duplicates from train data
y_train2=y_train.drop(index=dup_index, axis=0)
y_test2=y_test
print('There are {} number of duplicates in the x_train7.'.format(x_train7.duplicated().sum()))
# There are different solvers available in Sklearn logistic regression
# The newton-cg solver is faster for high-dimensional data
lg = LogisticRegression(solver="newton-cg", random_state=1)
model = lg.fit(x_train7, y_train2)
# adding constant
x_train8 = sm.add_constant(x_train7)
x_test8 = sm.add_constant(x_test7)
Our final training and testing sets for X variable are x_train8 and x_test8. Our final training and testing sets for Y variable are y_train2 and y_test2.
# fitting logistic regression model
logit = sm.Logit(y_train2, x_train8.astype(float))
lg = logit.fit(disp=False)
print(lg.summary())
Observations
Note: Since the multicollinearity is removed from the data the model coefficients and p-values are reliable.
Positive values of the coefficient show that that probability of booking cancelation increases with the increase of corresponding attribute value.
Negative values of the coefficient shows that probability of booking cancelation decreases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant. For instance, p-value of "type_of_meal_plan_Meal Plan 3" equals to 1 and hence it is considered as insignificant.
We will drop variables one by one by repeatedly doing following:
# running a loop to drop variables with high p-value
# initial list of columns
cols = x_train8.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = x_train8[cols]
# fitting the model
model = sm.Logit(y_train2, x_train_aux).fit(disp=False)
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
We extract the data correspondent to the selected features defined above.
x_train9 = x_train8[selected_features]
x_test9 = x_test8[selected_features]
We rebuild regression model using statmodels.
logit = sm.Logit(y_train2, x_train9.astype(float))
lg2 = logit.fit(disp=False)
print(lg2.summary())
Now the p-value of none of the features is greater than 0.05, so we'll consider the features in x_train9 as the final features for classification and and lg2 as our final model.
print("Training performance:")
train_sklearn =model_performance_classification_statsmodels(lg2, x_train9, y_train2, threshold=0.5)
train_sklearn
# creating confusion matrix
confusion_matrix_statsmodels(lg2, x_train9, y_train2, threshold=0.5)
print("Test performance:")
test_sklearn= model_performance_classification_statsmodels(lg2, x_test9, y_test2, threshold=0.5)
test_sklearn
# creating confusion matrix
confusion_matrix_statsmodels(lg2, x_test9, y_test2, threshold=0.5)
logit_roc_auc_train = roc_auc_score(y_train2, lg2.predict(x_train9))
fpr, tpr, thresholds = roc_curve(y_train2, lg2.predict(x_train9))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic- Training Set")
plt.legend(loc="lower right")
plt.show()
logit_roc_auc_test = roc_auc_score(y_test2, lg2.predict(x_test9))
fpr, tpr, thresholds = roc_curve(y_test2, lg2.predict(x_test9))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic- Test Set")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where TPR is high and FPR is low
fpr, tpr, thresholds = roc_curve(y_train2, lg2.predict(x_train9))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
print("\nTraining performance with optimum threshold:")
train_threshold_roc_train= model_performance_classification_statsmodels(lg2, x_train9, y_train2, threshold=optimal_threshold_auc_roc)
train_threshold_roc_train
# creating confusion matrix
confusion_matrix_statsmodels(lg2, x_train9, y_train2, threshold=optimal_threshold_auc_roc)
print("\nTest performance with optimum threshold:")
test_threshold_roc_train= model_performance_classification_statsmodels(lg2, x_test9, y_test2, threshold=optimal_threshold_auc_roc)
test_threshold_roc_train
# creating confusion matrix
confusion_matrix_statsmodels(lg2, x_test9, y_test2, threshold=optimal_threshold_auc_roc)
# Function to plot Precision-Recall Curve
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
#predicting y
y_scores = lg2.predict(x_train9)
#Calculating Precision and Recall for different thresholds
prec, rec, tre = precision_recall_curve(y_train2, y_scores,)
#Plotting the Precision_Recall Curve
plt.figure(figsize=(10, 7))
plt.title("Precision-Recall Curve- Training Set")
plot_prec_recall_vs_tresh(prec, rec, tre)
#predicting y
y_scores_test = lg2.predict(x_test9)
#Calculating Precision and Recall for different thresholds
prec, rec, tre = precision_recall_curve(y_test2, y_scores_test,)
#Plotting the Precision_Recall Curve
plt.figure(figsize=(10, 7))
plt.title("Precision-Recall Curve- Training Set")
plot_prec_recall_vs_tresh(prec, rec, tre)
# setting the threshold
optimal_threshold_curve = 0.42
print("\nTraining performance with optimum threshold:")
train_threshold_curve_train= model_performance_classification_statsmodels(lg2, x_train9, y_train2, threshold=optimal_threshold_curve)
train_threshold_curve_train
# creating confusion matrix
confusion_matrix_statsmodels(lg2, x_train9, y_train2, threshold=optimal_threshold_curve)
print("\nTest performance with optimum threshold:")
test_threshold_curve_train= model_performance_classification_statsmodels(lg2, x_test9, y_test2, threshold=optimal_threshold_curve)
test_threshold_curve_train
# creating confusion matrix
confusion_matrix_statsmodels(lg2, x_test9, y_test2, threshold=optimal_threshold_curve)
# training performance comparison
models_train_comp_df = pd.concat(
[
train_sklearn.T,
test_sklearn.T,
train_threshold_roc_train.T,
test_threshold_roc_train.T,
train_threshold_curve_train.T,
test_threshold_curve_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Lg model on training set- Threshold= Default",
"Lg model on test set- Threshold= Default",
"Lg model on training set- Threshold= 0.33",
"Lg model on test set- Threshold= 0.33",
"Lg model on training set- Threshold= 0.42",
"Lg model on test set- Threshold= 0.42",
]
print("\nTraining performance comparison:")
models_train_comp_df.T
Features in x_train9 is considered as the final train set and lg2 is our final model.
logit = sm.Logit(y_train2, x_train9.astype(float))
lg2 = logit.fit(disp=False)
print(lg2.summary())
Now, we can move towards the prediction part.
# predictions on the test set
pred = lg2.predict(x_test9)
df_pred_test = pd.DataFrame({"Actual": y_test2['booking_status'], "Predicted (probabiity)": pred})
df_pred_test.head(10)
In the above table, Predicted values indicate probability of booking cancelation which is shown by value 1.
For instance, in the first row we can see that the actual value of y is 0 which indicates that the booking has not been canceled. Our predicted value for y is 0.11 which indicates that based on the proposed Logistic Regression model the probability of booking cancelation is predicted to be 11%.
We can also visualize and compare the actual values and the predicted probability of occurance as bar graph below:
df1 = df_pred_test.head(25)
df1.plot(kind="bar", figsize=(15, 7));
Now, we apply threshold=0.42 to see the results of predictions:
df_pred_test["Predicted Value"] = df_pred_test["Predicted (probabiity)"].apply(lambda x: 1 if x >= 0.42 else 0)
df_pred_test
df1 = df_pred_test.head(25).drop('Predicted (probabiity)',axis=1)
df1.plot(kind="bar", figsize=(15, 7));
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
In this case, we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1)
model.fit(x_train9, y_train2);
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_decision_tree(
model, predictors, target
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f} ".format(item) + "{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
confusion_matrix_sklearn(model, x_train9, y_train2)
plt.title('confusion matrix for training data');
decision_tree_perf_train1 = model_performance_classification_decision_tree(model, x_train9, y_train2)
decision_tree_perf_train1
confusion_matrix_sklearn(model, x_test9, y_test2)
plt.title('confusion matrix for test data');
decision_tree_perf_test1 = model_performance_classification_decision_tree(model, x_test9, y_test2)
decision_tree_perf_test1
Since the model is highly overfitting, we need to prune the tree to eliminate capturing the noise of the training data in the model.
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 20, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00004, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5);
grid_obj = grid_obj.fit(x_train9, y_train2)
# Set the clf to the best combination of parameters
model_2 = grid_obj.best_estimator_
# Fit the best algorithm to the data.
model_2.fit(x_train9, y_train2);
Confusion matrix for training set is:
confusion_matrix_sklearn(model_2, x_train9, y_train2)
plt.title('confusion matrix for training data');
decision_tree_perf_train2 = model_performance_classification_decision_tree(model_2, x_train9, y_train2)
decision_tree_perf_train2
confusion_matrix_sklearn(model_2, x_test9, y_test2)
plt.title('confusion matrix for test data');
decision_tree_perf_test2 = model_performance_classification_decision_tree(model_2, x_test9, y_test2)
decision_tree_perf_test2
# creating a list of column names
feature_names = x_train9.columns.to_list()
# plotting the decision tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model_2, feature_names=feature_names, show_weights=True))
# Displaying important features in tree
importances = model_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The DecisionTreeClassifier provides parameters such as
min_samples_leaf and max_depth to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha. Greater values of ccp_alpha
increase the number of nodes pruned. Here we only show the effect of
ccp_alpha on regularizing the trees and how to choose a ccp_alpha
based on validation scores.
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process.
model_3 = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
path = model_3.cost_complexity_pruning_path(x_train9, y_train2)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path).head()
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves. The last value
in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, with one node. Hence, we remove it.
#removig
ccp_alphas = ccp_alphas[:-1]
ccp_alphas=np.delete(ccp_alphas, np.where(ccp_alphas < 0))
Next, we train a decision tree using the effective alphas.
models_Pool = []
for ccp_alpha in ccp_alphas:
model = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
model.fit(x_train9, y_train2)
models_Pool.append(model)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
models_Pool[-1].tree_.node_count, ccp_alphas[-1]
)
)
We show that the number of nodes and tree depth decreases as alpha increases.
node_counts = [model.tree_.node_count for model in models_Pool]
depth = [model.tree_.max_depth for model in models_Pool]
#ploting
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Calculating Recall and Precision for trainig data:
recall_train = []
precision_train=[]
for model in models_Pool:
pred_train = model.predict(x_train9)
values_train1 = recall_score(y_train2, pred_train)# to compute Recall
values_train2 = precision_score(y_train2, pred_train)# to compute Precision
recall_train.append(values_train1)
precision_train.append(values_train2)
Calculating Recall and Precision for test data:
recall_test = []
precision_test=[]
for model in models_Pool:
pred_test = model.predict(x_test9)
values_test1 = recall_score(y_test2, pred_test)# to compute Recall
values_test2 = precision_score(y_test2, pred_test)# to compute Precision
recall_test.append(values_test1)
precision_test.append(values_test2)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="recall_train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="recall_test", drawstyle="steps-post")
ax.legend()
plt.show()
When alpha increases, at first Recall increases but as alpha passes 0.009 we see a sharp drop in the Recall values.
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Precision")
ax.set_title("Precision vs alpha for training and testing sets")
ax.plot(
ccp_alphas, precision_train, marker="o", label="precision_train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, precision_test, marker="o", label="precision_test", drawstyle="steps-post")
ax.legend()
plt.show()
As alpha increases, the Precision value drops significantly.
Regarding the values of both Recall and Precision, it seems that appropriate amount of alpha should be defined by maximum Precision amount. This is due to the fact that the for a wide range of alpha values Recall remains in an appropriate level.
Next, we find a model with the best amount of alpha:
# Finding the model where we get highest Precision on the test set
index_best_model = np.argmax(precision_test)
model_3 = models_Pool[index_best_model]
print(model_3)
model_3.fit(x_train9, y_train2);
Confusion matrix for training set is:
confusion_matrix_sklearn(model_3, x_train9, y_train2)
plt.title('confusion matrix for training data');
decision_tree_perf_train3 = model_performance_classification_decision_tree(model_3, x_train9, y_train2)
decision_tree_perf_train3
confusion_matrix_sklearn(model_3, x_test9, y_test2)
plt.title('confusion matrix for test data');
decision_tree_perf_test3 = model_performance_classification_decision_tree(model_3, x_test9, y_test2)
decision_tree_perf_test3
# creating a list of column names
feature_names = x_train9.columns.to_list()
# plotting the decision tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model_3,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model_3, feature_names=feature_names, show_weights=True))
# Displaying important features in tree
importances = model_3.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Performance Summary for the Full , Pre_Prunned, and Post Prunned Tree:
# training performance comparison
DT_models_train_comp_df = pd.concat(
[
decision_tree_perf_train1.T,
decision_tree_perf_train2.T,
decision_tree_perf_train3.T,
decision_tree_perf_test3.T,
decision_tree_perf_test2.T,
decision_tree_perf_test3.T,
],
axis=1,
)
DT_models_train_comp_df.columns = [
"Full_Tree_Data_Training",
"Pre_Prunned_Data_Training",
"Post_Prunned_Data_Training",
"Full_Tree_Data_Test",
"Pre_Prunned_Data_Test",
"Post_Prunned_Data_Test",
]
print("\nperformance comparison of all three decision tree models:")
DT_models_train_comp_df.T
Now, we use Post_Prunned decision tree model (model_3) and move towards the prediction part.
# predictions on the test set
pred = model_3.predict(x_test9)
df_pred_test = pd.DataFrame({"Actual": y_test2['booking_status'], "Predicted": pred})
df_pred_test.head(10)
We can also visualize and compare the actual values and the predicted values as bar graph below:
df1 = df_pred_test.head(25)
df1.plot(kind="bar", figsize=(15, 7));
Performance comparison of the Chosen Decision Tree Model and the Chosen Logistic Regression Model:
print("\nperformance measures for the chosen regression model:")
LG=models_train_comp_df[['Lg model on training set- Threshold= 0.42','Lg model on test set- Threshold= 0.42']].T
LG
print("\nperformance measures of the chosen decision tree model:")
DT=DT_models_train_comp_df[['Post_Prunned_Data_Training','Post_Prunned_Data_Test']].T
DT
We plot the performance measure of the regression and the decision tree model as below:
T=pd.concat([LG,DT])
T.T.plot(kind="bar", figsize=(10, 5));
Since we are interested to have a balance Recall and Precision, threshold=0.42 serves the best for us.
We also have built a Full decision tree, a Pre-Prunned decision tree, and a Post-Prunned decision tree for classifications.
When using the Pre-Prunned tree, lead_time is the most important variable for predicting booking cancelations. Next to the lead_time, the important variables are no_of_special_requests, arrival_month, no_of_week_nights, and market_segment_type_office.
The decision tree performs better than the logistic regression model on the training set. However, performance of the logistic regression model exceeds the decision tree when dealing with the test data.
According to the decision tree model the most important feature influencing on booking cancelation is the lead_time which indicates number of days between the date of booking and the arrival date. The next important features are:
no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)arrival_month: Month of arrival dateno_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotelmarket_segment_type_office: Market segment designation with office typeBased on the tree classification, we can classify the new room bookings and check how they contribute to the posible booking cancelations. For instance:
This infomation helps Star Hotels mangers to recognize guests who have higher posibility of cancelations and take appropriate actions. For instance, if the booking does not seem to be promissing then other potential guests can be assigned to it. A waiting list can be prepared for this situations.
The hotel can use these insights to define a proper booking cancelation and refund policy. From the logstic model we learn that when the arrival time or the number of special requests increases, the probability of booking cancelation decreases. When market_segment_type is not office, the possibility of booking cancelation decreases. If new customers do not seem to contribute of opportunity loss (For instance, when market_segment_type is not office), hotel can regualte policies that eases refunding. This can help the bussiness by attracting more customers for the hotel. On the contrary, if new customers may risk the revenue of the hotel and contribute to opportunity loss, hotel can regualte more strict cancalation policies. This can help to reduce the possibility of cancelations by customers.